In this workshop, we will use simulated climate data based on real data from the Bureau of Meteorology at Canterbury Racecourse AWS {station 066194} collected in 2023. The simulated data contains several different daily measurements throughout Autumn (March-May).
We will use the following variables.
X3pm.temperature (daily temperature measured at
3 pm)Maximum.temperature (maximum daily
temperature)X9am.temperature (daily temperature measured at
9 am)Minimum.temperature (minimum daily
temperature)The temperature data are measured in Celsius. Please beware that the variable names are case sensitive.
Task 1: Download the data file
AutumnCleaned.csv in your data folder within
your STAT5002 folder. Then import the csv
file into a variable called data:
### write your code here.
data = read.csv("data/AutumnCleaned.csv", header = T)
### the following displays the dimension of the data
dim(data)
## [1] 92 16
names(data)
## [1] "X" "Minimum.temperature"
## [3] "Maximum.temperature" "Rainfall"
## [5] "Evaporation" "Sunshine..hours."
## [7] "Direction.of.maximum.wind.gust" "Speed.of.maximum.wind.gust"
## [9] "X9am.temperature" "X9am.relative.humidity"
## [11] "X9am.wind.direction" "X9am.wind.speed"
## [13] "X3pm.temperature" "X3pm.relative.humidity"
## [15] "X3pm.wind.direction" "X3pm.wind.speed"
Task 2: How many observations are
there? How many variables are there?
Answer: There are 92 observations and
16 variables.
### This is for internal use/solution only
T3pm = data$X3pm.temperature
T9am = data$X9am.temperature
Tmax = data$Maximum.temperature
Tmin = data$Minimum.temperature
Generate a scatter plot for the the daily temperature observed at 3
pm (X3pm.temperature) and the observed daily maximum
temperature (Maximum.temperature), with
X3pm.temperature (\(X\))
on the horizontal axis and Maximum.temperature (\(Y\)) on the vertical axis.
Generate another scatter plot for the daily temperature observed at 9
am (X9am.temperature) and the observed daily minimum
temperature (Minimum.temperature), with
X9am.temperature (\(X\))
on the horizontal axis and Minimum.temperature (\(Y\)) on the vertical axis.
Comment on and compare these associations based on the scatter plot.
### Write your code here
###
par(mfrow=c(1,2))
plot(T3pm, Tmax, xlab="X: 3pm temperature", ylab="Y: max temperature")
plot(T9am, Tmin, xlab="X: 9am temperature", ylab="Y: min temperature")
Answer: There is a strong positive
association between X3pm.temperature and Maximum.temperature. There is
also a positive association between X9am.temperature and
Minimum.temperature. The association between X3pm.temperature and
Maximum.temperature is stronger than that between X9am.temperature and
Minimum.temperature.
Using the function lm(), build a linear model for
predicting daily maximum temperature (Maximum.temperature)
given the temperature observed at 3 pm (X3pm.temperature).
Generate a scatter plot for X3pm.temperature and
Maximum.temperature. Plot the resulting regression line on
top of the scatter plot using abline(). Predict the value
of daily maximum temperature given a value of \(X=33\) for the temperature observed at 3
pm.
Use the function points(X, Y, col="red", cex=3, pch=19)
to plot the predict value \(Y\)
(together with the predictor \(X\)),
where the options col="red", cex=3, pch=19 specify the
color, the marker size, and the mark type, respectively.
### Write your code here
l = lm(Tmax~T3pm)
plot(T3pm, Tmax, xlab="X: 3pm temperature", ylab="Y: max temperature")
abline(l, col="blue")
X = 33
Y = l$coefficients[1] + l$coefficients[2]*X
Y
## (Intercept)
## 34.7929
points(X,Y, col="red", cex=3, pch=19)
# or the alternative, not taught in class
pred = predict(l, data.frame(T3pm = c(33)))
pred
## 1
## 34.7929
Generate the residual plot of the linear model built above. Comment on if the regression line is a good fit.
### Write your code here
plot(T3pm, l$residuals, xlab="X: 3pm temperature", ylab="Y: residual")
Answer: Here the key is to understand
that the changing variance of the residual (homoskedasticity or
heteroskedasticity) should be used to justify a linear model. Here are
some ideas. In this case, the regression line is a reasonable fit as the
residual plot does not show a significant trend of changing variances.
Note that although the data has changing density over X, but the
variation of the residual remain constant.
Compared to the baseline prediction, what percentage of variation in
the response variable Maximum.temperature can be explained
by the linear regression model fitted above.
### Write your code here
cor(T3pm, Tmax)^2
## [1] 0.8957635
Answer: Compared to the baseline
prediction, about 89.58% variations in the response variable can be
explained by the linear regression model.
Recall that the coefficient of determination of the regression line is \[ 1 - \frac{\text{SSE}}{\text{SST}}, \quad \text{SSE} = \sum_{i = 1}^n (y_i - a - b x_i)^2, \quad \text{SST} = \sum_{i = 1}^n (y_i - \bar{y})^2 \] where \(a\) and \(b\) are the intercept and the slope of the regression line, respectively. We want to verify that the coefficient of determination can be indeed be written as squared correlation coefficient (\(r^2\)) in R.
Task 1: Compute the sum of squared
residuals/errors (SSE) of the fitted linear model and the total sum of
squares of the dependent variable X3pm.temperature.
### Write your code here
SSE = sum(l$residuals^2)
SST = sum((Tmax - mean(Tmax))^2)
Task 2: Compute the coefficient of
determination using the above formula, and compare it with \(r^2\), do they agree?
### Write your code here
1 - SSE/SST
## [1] 0.8957635
cor(Tmax,T3pm)^2
## [1] 0.8957635
In an experiment, three fair dice are thrown independently. What is the chance of getting a total equal to 6?
Write R code to simulate 1000 experiments and report your result.
You should compare your result with other students, what do you observe?
set.seed(23)
m = 1000
totals=sample(1:6, m, rep = T)+sample(1:6, m, rep = T)+sample(1:6, m, rep = T)
table(totals)/m
## totals
## 3 4 5 6 7 8 9 10 11 12 13 14 15
## 0.007 0.011 0.025 0.048 0.066 0.105 0.112 0.135 0.134 0.113 0.094 0.065 0.041
## 16 17 18
## 0.029 0.013 0.002
barplot(table(totals), main="1000 rolls: sum of 3 dice")
Answer: the chance of getting a total
equal to 6 is approximately 0.046. Your results may differ. We will
learn why we observe such variations next week.